SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

hanishkvc · 2024-05-22T23:31:07Z

In Completion mode

now by default, the logic doesnt insert any role related "THE_ROLE: " prefix to the message, when generating the prompt json entry to the server for /completions endpoint. This gives full control to the end user to decide if they want to insert any prefix or not, based on the model they are communicating with in the completion mode. Given that /completion endpoint doesnt add any additional in-between special-tokens/tags/chat-templating of its own, using the multiline user input support, a end user could generate any combinatiion of messages with special tokens etal, as they see fit to check how the model will respond. Or use it just for a normal completion related query.
avoid the implicit textarea enter-key/newline char wrt the user-input, however if the end user explicitly wants newline in the sent message, they can always use shift+enter key press to insert the same.
fix a corner case with normally unexpected system prompt with completion mode, messing with the logic. Now its up to the end user if they want to put a system prompt (normally not expected nor needed in completion mode) or not, the logic wont raise an exception.

In general

update the readme as well as the usage message to capture the flow better.
add a placeholder hint wrt the system prompt

@mofosyne, do wait for half a day (12 hours), I will cross check things once later today, before you merge this. At sametime if anyone is trying to use SimpleChat in the mean time, the commits in this PR could be useful.

hanishkvc · 2024-05-23T22:17:03Z

Consolidate global control variables into gMe, to allow developers to modify some of the behaviour at runtime by changing gMe members in browser's devel-tools console.

Add support for a optional (disabled by default) sliding window wrt what amount of the chat history is sent to the server/ai-model when chating.

Add configuration fields sent to /chat/completions or /completions end points, as members of gMe (chatRequestOptions),
instead of hard coding in the code flow of the logic. This allows these fields to be modified/added-to/deleted from dev console. Also add max_tokens field to the mix when communicating with server endpoints.

Initial skeleton, for future, for trying to fetch json data as it is getting generated, rather than in one shot at the end, if servers support the same (need to look in to web service api in more detail later).

Update the usage messages shown wrt new/fresh chat. Also show the current status of dev console controllable configurations.

To keep things simple, avoid mixing chat and completion messages, once at the time of switch, when switching from chat to completion mode.

Just have a alert msg wrt needing javascript enabled in html. And have usage message from js file. Update the usage message a bit. So also enable switch session wrt setup_ui call. Add a possible system prompt as a placeholder for the system-input.

In completion mode * avoid inserting Role: prefix before each role's message * avoid inserting newline at the begin and end of the prompt message. However if there are multiple role messages, then insert newline when going from one role's message to the next role's message.

Readme update wrt completion mode behavior. Usage help updated wrt completion mode behavior. When changing from input to textarea elment wrt user input, the last newline at the end of the user input wrt textarea, was forgotten to be filtered, this is fixed now. However if user wants to have a explicit newline they can using shift+enter to insert a newline, that wont be removed. The extra newline removal logic uses substring and keyup to keep things simple and avoid some previously noted bugs wrt other events in the key path as well as IME composition etal.

previous logic would have cleared/reset the xchat, without doing the same wrt iLastSys, thus leading to it pointing to a now non existent role-content entry. So if a user set a system prompt and used completion mode, it would have done the half stupid clear, after the model response was got. Inturn when user tries to send a new completion query, it would inturn lead to handle_user_submit trying to add/update system prompt if any, which will fail, bcas iLastSys will be still pointing to a non existant entry. This is fixed now, by having a proper clear helper wrt SC class.

Previously any chat history including model response to a completion query would have got cleared, after showing the same to the user, at the end of handle_user_submit, rather than at the begining. This gave the flexibility that user could switch from chat mode to completion mode and have the chat history till then sent to the ai model, as part of the completion query. However this flow also had the issue that, if user switches between different chat sessions, after getting a completion response, they can no longer see the completion query and its response that they had just got. The new flow changes the clearing of chat history wrt completion mode to the begining of handle_user_submit, so that user doesnt lose the last completion mode query and response, till a new completion mode query is sent to the model, even if they were to switch between the chat sessions. At the same time the loss of flexibility wrt converting previous chat history into being part of the completion query implicitly doesnt matter, because now the end user can enter multiline queries.

For later the server flow doesnt seem to be sending back data early, atleast for the request (inc options) that is currently sent. if able to read json data early on in future, as and when ai model is generating data, then this helper needs to indirectly update the chat div with the recieved data, without waiting for the overall data to be available.

Keep the title simple so that print file name doesnt have chars that need to be removed. Update readme wrt some of the new helpers and options. Change Usage list to a list of lists, add few items and style it to reduce the margin wrt lists.

As some times based on the query from the user, the ai model may get into a run away kind of generation with repeatations etal, so adding max_tokens to try and limit this run away behaviour, if possible.

This allows the end user to see the settings used by the logic, as well as allows users to change/update the settings if they want to by using devel-tools/console

This is disabled by default. However if enabled, then in addition to latest system message, only the last N user messages, after the latest system message and its reponses from the ai model will be sent to the ai-model, when querying for a new response. This specified N also includes the latest user query.

hanishkvc · 2024-05-23T23:30:30Z

Hi @mofosyne,

This set of updates/commits is ready for merging with main/master.

[Look at next comment]

hanishkvc · 2024-05-24T09:39:19Z

Hi @ggerganov @mofosyne @ngxson @teleprint-me / all

Based on some testing across 7b and below models (Llama3, Mistral v02, Phi3, Phi2) on a 8GB laptop, what I notice is that

[A] confining the context sent to the server/model to system + last-query-response + current-query seems to generally lead to ok enough responses from the model. With additional max_tokens=512 limiting any run away garbage/repeat text generation. This also limits impact of run-away garbage/text from past chat not affecting beyond immidiate next question, in general.
[B] sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries. Setting max_tokens limits the runaway garbage to specified limit (wrt /chat/completions endpoint) but many a times may also limit the answer generation to be chopped/partial in subsequent questions more often.

So do you want the default context window sent from the client side to server to be limited to system+last-query-response+cur-query or do you want the client to always blindly send the full chat history wrt a given chat-session.

For a developer keeping client side context window limiting logic to unlimited, may be useful to test server/ai-model implications. While for a simple/normal end-user usage keeping client side context window limited to system + last-qeury-response + cur-query may make some sense.

Users also have the option of using NewCHAT to start a new chat session.

NOTE: The implemented client side context window limiting logic is flexible and one can control the amount of chat history sent interms of latest system prompt update in a given chat session and how much of latest user query + assistant responses after that should be included or should full history be included. But the two possibilities I mentioned above indicates the issue and possible solution in a simple and straight way.

NOTE: @ngxson the server currently doesnt seem to honor max_tokens wrt /completions endpoint, while /chat/completions endpoint does seem to honor it. This angle could also help fine tune the CI test failure which was seen wrt server testing, based on a quick glance at what I saw when it was noticed on the original simplechat pr.

ngxson · 2024-05-24T09:58:57Z

the server currently doesnt seem to honor max_tokens wrt /completions endpoint, while /chat/completions endpoint does seem to honor it. This angle could also help fine tune the CI test failure which was seen wrt server testing, based on a quick glance at what I saw when it was noticed on the original simplechat pr.

In the /chat/completions endpoint, we uses the field name max_tokens because it provides compatibility with OAI API.

The plain /completions is not OAI-compat, the field name is n_predict

ngxson · 2024-05-24T10:27:33Z

sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries.

Regarding this point, maybe you can enable cache_prompt: true for each request. This allows you to re-use KV cache, so when you add a new message to chat history, it won't need re-evaluate all old messages.

However, this approach only works when you add more messages to the history. This won't work with sliding window, because we remove earlier messages in the history. Please have a look at #5793 for more details.

hanishkvc · 2024-05-24T11:14:54Z

Hi @ngxson @ggerganov

Will checkout behaviour with cache_prompt (with and without max_tokens), at same time I am not fully sure this will solve the core issue, which is to an extent caused by model's context window.

#5793, I assume you are mentioning wrt future, as it seems to be paused for now? Also from a casual look at things mentioned in it, the current caching (as well as the suggestion discussed in it) is mainly useful for the common case of having the core user entered context in system prompt and or system-prompt + 1st few fixed handshake (which can be indirectly taken as a larger system prompt) inturn followed by a changing tail in the query/chat sent. Which either way inturn also seem to favor a large-system-prompt along with a client-side sliding window mechanism.

sending full previous chat history to server/ai-model, may lead to more slower response as well as more chances of run away garbage/repeat text at the end for some models and queries.

Regarding this point, maybe you can enable cache_prompt: true for each request. This allows you to re-use KV cache, so when you add a new message to chat history, it won't need re-evaluate all old messages.

However, this approach only works when you add more messages to the history. This won't work with sliding window, because we remove earlier messages in the history. Please have a look at #5793 for more details.

Reduce chat history context sent to the server/ai-model to be just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of the previous full chat history. This way if there is any response with garbage/repeatation, it doesnt mess with things beyond the next question, in some ways. Increase max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space available wrt next query-response. However dont forget that the server when started should also be started with a model context size of 1k or more, to be on safe side. Add frequency and presence penalty fields set to 1.2 to the set of fields sent to server along with the user query. So that the model is partly set to try avoid repeating text in its response.

The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the internal n_predict, for now add the same on the client side, maybe later add max_tokens to /completions endpoint handling.

hanishkvc · 2024-05-24T18:15:09Z

Hi @ngxson @ggerganov

Did test with cache_params, didnt change things much. Also specifically didnt check wrt speed impact if any, rather the garbage phase will eat up time. As with before max_tokens ensures that garbage text repeatation doesnt runaway.

Additionally presence/frequence penalty, also didnt help wrt avoiding garbage text repeatation wrt the queries which generated the same.

Tested the above two aspects for now with Llama3 8b.

As a sum total of all, from a overall perspective, for now have setup the defaults in SimpleChat ie client side such that

by default it sends only system-prompt + immidiate-prev-user-request-ai-response + cur-request (ie chat mode)
- even the chat displayed will update accordingly.
max_tokens is set to 1024 (have added n_predict also set to same)
- server by default may set model ctxt size to 512 tokens or so, one may want to modify that also when running
frequency/presence penalty is set to 1.2

Developer/end-user is shown this info when ever a new-chat is started. And they can modify these if required, by changing gMe from browser's devel-tools/console, for now.

teleprint-me · 2024-05-24T20:02:39Z

@hanishkvc The context can be dynamically managed. Another possibility is to make a dynamic upper limit defined by the user. Artificially limiting the context seems undesirable.

hanishkvc · 2024-05-24T22:01:23Z

Hi @teleprint-me @ggerganov @ngxson

Currently llama.cpp/server implicitly limits the model context size. What I have seen is that by default it is 512 tokens, ie if one runs server without -c CONTEXT_SIZE argument. And either way independent of llama.cpp, most of the current models either way have implicit relatively small context size limits. Things are moving towards larger implicit ctxt size or sliding ctxt window on the model side etal, but not fully evolved nor wide spread yet.

Inturn what I have seen is that when the [system-prompt + chat-history + current-query] and inturn the response generated for the current-query together exceed this context size, most of the times it normally leads to repeating garbage text, which seems to implicitly stop once everything including garbage text reaches around 8K tokens (some limit enforced by llama.cpp or so potentially I havent looked into), which inturn will take a sufficiently lot of time on laptops without GPU etal. Using max_tokens hint from client side, forces the server to stop once that many tokens have been generated by the ai-model, irrespective of whether it indicates end of generation or not.

Given that client doesnt have control over the model's context size, as well as no control over the question/query that will be asked by the end user, which inturn may lead to a long answer which can inturn lead to context size exceeding, so that is the reason why I am forcing that max_tokens limit. Its also the reason why I implemented the client side sliding window concept to try and see that the context loaded by the system+history+query doesnt occupy too much of the context size on average.

Currently I havent added a UI specifically for setting it and other config parameters. However I expose most of the logic / behaviour controlling parameters over a global variable gMe, so that one can modify them from browser's devel-tool/console.

Maybe I will add a UI to control these settings. I originally started on SimpleChat, more to test some logic wrt llama.cpp and inturn server. So idea has been to keep the code basic and minimal, so that anyone wanting to test llama.cpp and server dont have to dig through code too much to change things or so.

Also I havent looked into llama.cpp and server code flow in detail currently, only specific parts when I faced some issue. So not sure of the llama and server flow fully. At same time I assume currently to allow for accomodating fixed context and sliding context related models etal, in a straight and simple way maybe llama.cpp and server inturn dont implicitly stop on context size being reached, instead depend on max_tokens/n_predict wrt when to force a stop.

@hanishkvc The context can be dynamically managed. Another possibility is to make a dynamic upper limit defined by the user. Artificially limiting the context seems undesirable.

mofosyne

Just did a quick code review. Looks alright, but will wait for one other reviewer for a bit before merging.

Regarding targeted audience. I would think ordinary people would prefer the standard interface. As you said, this is more targeted towards developers so feature set should be lean and mean and focus on explainability of the code itself. If anything, you may want to emphasise that point in code, so developers don't try to feature creep it.

hanishkvc · 2024-05-25T14:00:54Z

Just did a quick code review. Looks alright, but will wait for one other reviewer for a bit before merging.

Regarding targeted audience. I would think ordinary people would prefer the standard interface. As you said, this is more targeted towards developers so feature set should be lean and mean and focus on explainability of the code itself. If anything, you may want to emphasise that point in code, so developers don't try to feature creep it.

The idea is to be easy enough to use for basic purposes, while also being simple and easily discernable by developers who may not be from web frontend background (so inturn may not be familiar with template/end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things.

Because there is some amount of experimentation currently going on with the default interface exposed by server, so chances are for now this may be more approachable for ordinary users also.

And given that the idea is also to help explore/experiment for developers, some flexibility is provided to change behaviour easily using the devel-tools/console for now. And skeletal logic has been implemented to explore some of the end points and ideas/implications around them.

mofosyne · 2024-05-25T15:22:01Z

Sounds reasonable enough. Let's see if we get any positive assessment or objection, otherwise I'll merge it in a few days or so.

hanishkvc · 2024-05-25T22:05:53Z

Hey @mofosyne,

Please do go ahead and merge this in, this is a logical bunch of straight forward cleanup and improvement to the initial SimpleChat.

If possible I want to start a new PR on a optional simple histogram based auto garbage identification and trimming logic, and inturn dial down frequence/presence penalty, bcas those penalities just makes the garbage text repeatition, more difficult to identify, because in some of the models like llama3 it just seems to over-intelligently repeat the text, but with small variations between the repeatations, making it difficult to auto identify the garbage.

And also maybe a simple setup ui.

But I want to branch it from the merged master here, rather than this PR branch. And finish it tomorrow itself or so, so that I can get back to other things.

mofosyne · 2024-05-26T00:54:21Z

@hanishkvc okay, had a second lookover. Considering the isolated changes to an isolated subsystem, happy to merge in so you can proceed with your next idea.

But just gonna double check, do you really want to add these extra feature or is this feature creeping? What is your justification for adding these in this 'simple' ui, rather than placing it in the main tool for instance. You need to balance between the code objective to be read vs the code acting as a developer tool. If needed and you can justify it, then should we make a 'dev tool' ui folder and keep this as a stripped down demo?

hanishkvc · 2024-05-26T07:26:47Z

@mofosyne I agree in principle,

My idea is to keep the settings related code confined to the Me class (which maintains the global configurable parameters and shows it) or a equivalent new class/file, and the dumb garbage trimming into its own helper class/file. With minimal change to existing code flow and bit more of cleanup/structuring wrt moving some existing helpers into their own file like a ui file or so.

So the end goal is to still keep it simple and potentially even more structured, while also hopefully making it easier to use some of its flexibility.

hanishkvc · 2024-05-26T22:19:48Z

PR #7548 extends SimpleChat with a simple minded garbage trimming logic and a minimal settings ui.

khimaros · 2024-06-12T14:24:04Z

@hanishkvc i suspect your new UI may be running into the same caching issues that affect other third party clients, as described in: #7185

hanishkvc · 2024-06-14T03:20:02Z

Hi @khimaros

I had looked at prompt_cache sometime back. However because I use relatively fixed system-prompt + a sliding window into user chat flow, so the current prompt_cache caching which expects a contiguous history (for its own valid reasons), wont help much from re-use of cached data perspective.

But at the same time, technically there shouldnt be any harm in keeping the prompt_cache enabled, so that for now the system-prompt part (and in future maybe the sliding user-chat part also can get reused in appropriate way, when caching logic acquires that capability). I will send a patch for simplechat keeping it enabled, as well as adding a delete button if there is previous chat history available from a previous session, that one may want to remove, tomorrow.

github-actions bot added examples server labels May 22, 2024

hanishkvc marked this pull request as draft May 22, 2024 23:31

mofosyne mentioned this pull request May 23, 2024

SimpleChat: a simple and dumb web front end for testing /chat/completions and /completions end points and try chat #7350

Merged

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 23, 2024

hanishkvc changed the title ~~SimpleChat Completion Mode flexibility and cleanup~~ SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window May 23, 2024

hanishkvc marked this pull request as ready for review May 23, 2024 23:13

hanishkvc added 16 commits May 24, 2024 04:53

SimpleChat:CompletionMode: Allow control of Role: prefix

3c11098

SimpleChat: Update usage note and readme a bit

01594da

SimpleChat: Rename the half asleep mis-spelled global var

5d84a92

SimpleChat: Common chat request options from a global object

073eae6

SimpleChat:ChatRequestOptions: max_tokens

cbd853e

As some times based on the query from the user, the ai model may get into a run away kind of generation with repeatations etal, so adding max_tokens to try and limit this run away behaviour, if possible.

SimpleChat: Reduce max_tokens to be small but still sufficient

4b29736

SimpleChat: Consolidate global vars into gMe, Display to user

f0dd91d

This allows the end user to see the settings used by the logic, as well as allows users to change/update the settings if they want to by using devel-tools/console

SimpleChat: placeholder based usage hint for user-in textarea

11d2d31

hanishkvc force-pushed the hkvc_webfrontend_simplechat_v2 branch from 45eb798 to 11d2d31 Compare May 23, 2024 23:28

hanishkvc added 2 commits May 24, 2024 23:12

SimpleChat:Add n_predict (equiv max_tokens) for llamacpp server

b3afd6c

The /completions endpoint of examples/server doesnt take max_tokens, instead it takes the internal n_predict, for now add the same on the client side, maybe later add max_tokens to /completions endpoint handling.

mofosyne approved these changes May 25, 2024

View reviewed changes

SimpleChat: Note about trying to keep things simple yet flexible

6d2f3d9

mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 25, 2024

mofosyne approved these changes May 26, 2024

View reviewed changes

mofosyne merged commit b9adcbb into ggerganov:master May 26, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

hanishkvc commented May 22, 2024 •

edited

Loading

hanishkvc commented May 23, 2024

hanishkvc commented May 23, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

ngxson commented May 24, 2024

ngxson commented May 24, 2024

hanishkvc commented May 24, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

teleprint-me commented May 24, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

mofosyne left a comment •

edited

Loading

hanishkvc commented May 25, 2024

mofosyne commented May 25, 2024

hanishkvc commented May 25, 2024 •

edited

Loading

mofosyne commented May 26, 2024 •

edited

Loading

hanishkvc commented May 26, 2024 •

edited

Loading

hanishkvc commented May 26, 2024

khimaros commented Jun 12, 2024

hanishkvc commented Jun 14, 2024 •

edited

Loading

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

Conversation

hanishkvc commented May 22, 2024 • edited Loading

hanishkvc commented May 23, 2024

hanishkvc commented May 23, 2024 • edited Loading

hanishkvc commented May 24, 2024 • edited Loading

ngxson commented May 24, 2024

ngxson commented May 24, 2024

hanishkvc commented May 24, 2024 • edited Loading

hanishkvc commented May 24, 2024 • edited Loading

teleprint-me commented May 24, 2024 • edited Loading

hanishkvc commented May 24, 2024 • edited Loading

mofosyne left a comment • edited Loading

Choose a reason for hiding this comment

hanishkvc commented May 25, 2024

mofosyne commented May 25, 2024

hanishkvc commented May 25, 2024 • edited Loading

mofosyne commented May 26, 2024 • edited Loading

hanishkvc commented May 26, 2024 • edited Loading

hanishkvc commented May 26, 2024

khimaros commented Jun 12, 2024

hanishkvc commented Jun 14, 2024 • edited Loading

hanishkvc commented May 22, 2024 •

edited

Loading

hanishkvc commented May 23, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

teleprint-me commented May 24, 2024 •

edited

Loading

hanishkvc commented May 24, 2024 •

edited

Loading

mofosyne left a comment •

edited

Loading

hanishkvc commented May 25, 2024 •

edited

Loading

mofosyne commented May 26, 2024 •

edited

Loading

hanishkvc commented May 26, 2024 •

edited

Loading

hanishkvc commented Jun 14, 2024 •

edited

Loading